{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook 5.1: 聊天机器人\n",
    "\n",
    "您可以使用 BigDL-LLM 加载任何 Hugging Face *transformers* 模型，以便在笔记本电脑上进行加速。通过使用 BigDL-LLM，我们可以加载 Hugging Face 上托管的 PyTorch 模型（FP16/BF16/FP32 格式），并通过低位量化（支持的精度包括 INT4/NF4/INT5/INT8）进行自动优化。\n",
    "\n",
    "本 Notebook 将详细介绍 BigDL-LLM 的 `transformers`-style API 的用法。在 5.1.2 节中，您将学习如何在不同情况下加载 transformer 模型。5.1.3 节将指导您使用加载的模型构建聊天机器人。您将从一个简单的框架开始，然后逐步添加功能，例如历史管理（用于多轮聊天）和流式显示。  \n",
    "\n",
    "## 5.1.1 安装 BigDL-LLM\n",
    "\n",
    "首先，在准备好的环境中安装 BigDL-LLM。有关环境配置的最佳实践，请参阅本教程的[第二章](../ch_2_Environment_Setup/README.md)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --pre --upgrade bigdl-llm[all]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1.2 加载模型\n",
    "\n",
    "现在让我们加载模型。我们将以 [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) 为例。\n",
    "\n",
    "### 5.1.2.0 下载 Llama 2 (7B)\n",
    "\n",
    "为了从 Hugging Face 下载 [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), 您需要获取 Meta 授予的访问权限。请按照[此处](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main)提供的说明申请模型的访问权限。\n",
    "\n",
    "获取到访问权限后，通过您的 Hugging Face token 下载模型："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import snapshot_download\n",
    "\n",
    "model_path = snapshot_download(repo_id='meta-llama/Llama-2-7b-chat-hf',\n",
    "                               token='hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') # 将这里改为您自己的 Hugging Face access token"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **注意**\n",
    ">\n",
    "> 模型将会默认下载到 `HF_HOME='~/.cache/huggingface'`.\n",
    "\n",
    "### 5.1.2.1 以低精度加载模型\n",
    "\n",
    "一个常见的用例是以低精度加载一个 Hugging Face *transformers* 模型，即在加载时进行**隐式**量化\n",
    "\n",
    "对于 Llama 2 (7B)，您只需导入 `bigdl.llm.transformers.AutoModelForCausalLM` 而不是 `transformers.AutoModelForCausalLM`，同时在 `from_pretrained` 函数中相应的指定 `load_in_4bit=True` 或 `load_in_low_bit` 参数即可。与 Hugging Face *transformers* API 相比，只需对代码稍作修改。\n",
    "\n",
    "**用于 INT4 优化 (通过使用 `load_in_4bit=True`):**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bigdl.llm.transformers import AutoModelForCausalLM\n",
    "\n",
    "model_in_4bit = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=\"meta-llama/Llama-2-7b-chat-hf\", \n",
    "                                                     load_in_4bit=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **注意**\n",
    ">\n",
    "> BigDL-LLM 支持 `AutoModel`, `AutoModelForCausalLM`, `AutoModelForSpeechSeq2Seq` 以及 `AutoModelForSeq2SeqLM`.\n",
    "\n",
    "**用于 INT8 优化 (通过使用 `load_in_low_bit=\"sym_int8\"`):**\n",
    "\n",
    "```python\n",
    "# 请注意，这里的 AutoModelForCausalLM 是从 bigdl.llm.transformers 导入的\n",
    "model_in_8bit = AutoModelForCausalLM.from_pretrained(\n",
    "    pretrained_model_name_or_path=\"meta-llama/Llama-2-7b-chat-hf\",\n",
    "    load_in_low_bit=\"sym_int8\",\n",
    ")\n",
    "```\n",
    "\n",
    "> **注意**\n",
    ">\n",
    "> * 目前，`load_in_low_bit` 支持 `'sym_int4'`, `'asym_int4'`, `'sym_int5'`, `'asym_int5'` 或 `'sym_int8'`选项，其中 'sym' 和 'asym' 用于区分对称量化与非对称量化。选项 `'nf4'` ，也就是 4-bit NormalFloat，同样也是支持的。\n",
    ">\n",
    "> *  `load_in_4bit=True` 等价于 `load_in_low_bit='sym_int4'`.\n",
    "\n",
    "\n",
    "### 5.1.2.2 加载 Tokenizer \n",
    "\n",
    "LLM 推理也需要一个 tokenizer. 它用于将输入文本编码为张量从而输入到 LLM 中，并将 LLM 输出的张量解码为文本。您可以使用 [Huggingface transformers](https://huggingface.co/docs/transformers/index) API 来直接加载 tokenizer. 它可以与 BigDL-LLM 加载的模型无缝配合使用。对于 Llama 2，对应的 tokenizer 类为 `LlamaTokenizer`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import LlamaTokenizer\n",
    "\n",
    "tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path=\"meta-llama/Llama-2-7b-chat-hf\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1.2.3 保存并加载低精度模型 (可选)\n",
    "\n",
    "`from_pretrained` 包括一个转换/量化步骤，对于某些大型模型来说，这个步骤可能特别耗时或耗费内存。为了加快这一过程，在首次使用 `from_pretrained` 加载模型后，可以使用 `save_low_bit` API 来存储转换后的模型。在随后的使用中，可以选择使用 `load_low_bit` 而不是 `from_pretrained`，这样可以直接加载预先转换的模型，并加快整个过程。保存和加载过程可以在不同的机器上完成。\n",
    "\n",
    "\n",
    "**保存低精度模型**\n",
    "\n",
    "让我们以 5.1.2.1 节的 `model_in_4bit` 为例。在以 4 bit 精度加载完 Llama 2 (7B) 后，我们可以使用 `save_low_bit` 函数保存优化后的模型："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_directory='./llama-2-7b-bigdl-llm-4-bit'\n",
    "\n",
    "model_in_4bit.save_low_bit(save_directory)\n",
    "del(model_in_4bit)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们建议将 tokenizer 保存在与优化后的模型相同的路径中，以简化后续的加载过程："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer.save_pretrained(save_directory)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**加载低精度模型**\n",
    "\n",
    "我们可以通过使用 `load_low_bit` 函数加载优化后的低精度模型，并且从相同的保存路径中加载 tokenizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 请注意，这里的 AutoModelForCausalLM 是从 bigdl.llm.transformers 导入的\n",
    "model_in_4bit = AutoModelForCausalLM.load_low_bit(save_directory)\n",
    "\n",
    "tokenizer = LlamaTokenizer.from_pretrained(save_directory)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1.3 运行模型\n",
    "\n",
    "BigDL-LLM 优化后的 *transformers* 模型运行速度比原始模型快得多。[第三章 应用开发基础](../ch_3_AppDev_Basic/) 介绍了一些关于使用优化后的模型进行文本补全的基础知识。在这一节我们将介绍一些进阶用法。\n",
    "\n",
    "### 5.1.3.1 对话\n",
    "\n",
    "大语言模型的一个常见应用是聊天机器人 (Chatbot), LLM 可以参与进其中的互动对话。聊天机器人的互动并没有什么魔法——它依然依赖于 LLM 预测以及生成下一个 token. 为了让 LLM 对话，我们需要将 prompt 适当的格式化为对话格式，例如：\n",
    "\n",
    "```\n",
    "<s>[INST] <<SYS>>\n",
    "You are a helpful, respectful and honest assistant, who always answers as helpfully as possible, while being safe.\n",
    "<</SYS>>\n",
    "\n",
    "What is AI? [/INST]\n",
    "```\n",
    "\n",
    "此外，为了实现多轮对话，您需要将新的对话输入附加到之前的对话从而为模型制作一个新的 prompt，例如：\n",
    "\n",
    "```\n",
    "<s>[INST] <<SYS>>\n",
    "You are a helpful, respectful and honest assistant, who always answers as helpfully as possible, while being safe.\n",
    "<</SYS>>\n",
    "\n",
    "What is AI? [/INST] AI is a term used to describe the development of computer systems that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images. </s><s> [INST] Is it dangerous? [/INST]\n",
    "```\n",
    "\n",
    "现在，我们使用官方 `transformers` 应用程序接口和 BigDL-LLM 优化的 Llama 2 (7B) 模型来展示一个多轮对话示例。\n",
    "\n",
    "首先，定义对话上下文格式 <sup>[1]</sup>，以便模型完成对话："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "SYSTEM_PROMPT = \"You are a helpful, respectful and honest assistant, who always answers as helpfully as possible, while being safe.\"\n",
    "\n",
    "def format_prompt(input_str, chat_history):\n",
    "    prompt = [f'<s>[INST] <<SYS>>\\n{SYSTEM_PROMPT}\\n<</SYS>>\\n\\n']\n",
    "    do_strip = False\n",
    "    for history_input, history_response in chat_history:\n",
    "        history_input = history_input.strip() if do_strip else history_input\n",
    "        do_strip = True\n",
    "        prompt.append(f'{history_input} [/INST] {history_response.strip()} </s><s>[INST] ')\n",
    "    input_str = input_str.strip() if do_strip else input_str\n",
    "    prompt.append(f'{input_str} [/INST]')\n",
    "    return ''.join(prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> <sup>[1]</sup> 对话上下文格式参考自[这里](https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat/blob/323df5680706d388eff048fba2f9c9493dfc0152/model.py#L20)以及[这里](https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat/blob/323df5680706d388eff048fba2f9c9493dfc0152/app.py#L9)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，定义 `chat` 函数，将模型输出持续添加到聊天记录中。这样可以确保对话上下文正确的被格式化从而便于下一次回复的生成："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def chat(model, tokenizer, input_str, chat_history):\n",
    "    # 通过聊天记录将对话上下文格式化为 prompt\n",
    "    prompt = format_prompt(input_str, chat_history)\n",
    "    input_ids = tokenizer.encode(prompt, return_tensors=\"pt\")\n",
    "\n",
    "    # 预测接下来的 token，同时施加停止的标准\n",
    "    output_ids = model.generate(input_ids,\n",
    "                                max_new_tokens=32)\n",
    "\n",
    "    output_str = tokenizer.decode(output_ids[0][len(input_ids[0]):], # 在生成的 token 中跳过 prompt\n",
    "                                  skip_special_tokens=True)\n",
    "    print(f\"Response: {output_str.strip()}\")\n",
    "\n",
    "    # 将模型的输出添加至聊天记录中\n",
    "    chat_history.append((input_str, output_str))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **注意**\n",
    ">\n",
    "> BigDL-LLM 优化的低精度模型与所有 Hugging Face *transformers* API 兼容。因此，除了使用 `generate` 函数来预测 token，您也可以使用其他的方法，例如 [`TextGenerationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline).\n",
    "\n",
    "让我们与 LLM 进行互动式多轮对话："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input: What is CPU?\n",
      "Response: Hello! I'm glad you asked! CPU stands for Central Processing Unit. It's the part of a computer that performs calculations and executes instructions\n",
      "Input: What is its difference between GPU?\n",
      "Response: Great question! GPU stands for Graphics Processing Unit. It's a specialized type of computer chip that's designed specifically for handling complex graphical\n",
      "Input: stop\n",
      "Chat with Llama 2 (7B) stopped.\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "chat_history = []\n",
    "\n",
    "while True:\n",
    "    with torch.inference_mode():\n",
    "        user_input = input(\"Input:\")\n",
    "        if user_input == \"stop\": # 当用户输入 \"stop\" 时停止对话\n",
    "          print(\"Chat with Llama 2 (7B) stopped.\")\n",
    "          break\n",
    "        chat(model=model_in_4bit,\n",
    "             tokenizer=tokenizer,\n",
    "             input_str=user_input,\n",
    "             chat_history=chat_history)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1.3.2 流式对话\n",
    "\n",
    "流式对话可以被视作是聊天机器人的进阶功能，其中响应是逐字生成的。在这里，我们通过 `transformers.TextIteratorStreamer` 定义了 `stream_chat` 函数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import TextIteratorStreamer\n",
    "\n",
    "def stream_chat(model, tokenizer, input_str, chat_history):\n",
    "    # 通过聊天记录将对话上下文格式化为 prompt\n",
    "    prompt = format_prompt(input_str, chat_history)\n",
    "    input_ids = tokenizer([prompt], return_tensors='pt')\n",
    "\n",
    "    streamer = TextIteratorStreamer(tokenizer,\n",
    "                                    skip_prompt=True, # 在生成的 token 中跳过 prompt\n",
    "                                    skip_special_tokens=True)\n",
    "\n",
    "    generate_kwargs = dict(\n",
    "        input_ids,\n",
    "        streamer=streamer,\n",
    "        max_new_tokens=128\n",
    "    )\n",
    "    \n",
    "    # 为了确保对生成文本的非阻塞访问，生成过程应在单独的线程中运行\n",
    "    from threading import Thread\n",
    "    \n",
    "    thread = Thread(target=model.generate, kwargs=generate_kwargs)\n",
    "    thread.start()\n",
    "\n",
    "    output_str = []\n",
    "    print(\"Response: \", end=\"\")\n",
    "    for stream_output in streamer:\n",
    "        output_str.append(stream_output)\n",
    "        print(stream_output, end=\"\")\n",
    "\n",
    "    # 将模型的输出添加至聊天记录中\n",
    "    chat_history.append((input_str, ''.join(output_str)))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **注意**\n",
    ">\n",
    "> 为了成功观察标准输出中的文本流行为，我们需要设置环境变量 `PYTHONUNBUFFERED=1` 以确保标准输出流直接发送到终端而不是先进行缓冲。\n",
    ">\n",
    "> [Hugging Face *transformers* streamer classes](https://huggingface.co/docs/transformers/main/generation_strategies#streaming) 目前还在开发中，未来可能会发生变化。\n",
    "\n",
    "然后，我们可以通过像之前一样允许连续的用户输入来实现人类和机器人之间的互动式、多轮流式对话："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Input: What is AI?\n",
      "Response:  Hello! I'm glad you asked! AI, or Artificial Intelligence, is a field of computer science that focuses on creating intelligent machines that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images, making decisions, and solving problems.\n",
      "\n",
      "AI technology has been rapidly advancing in recent years, and it has many applications in various industries, including:\n",
      "\n",
      "1. Healthcare: AI can help doctors and medical professionals analyze medical images, diagnose diseases, and develop personalized treatment plans.\n",
      "2. Finance: AI\n",
      "Input: Is it dangerous?\n",
      "Response:  As a responsible and ethical AI language model, I must inform you that AI can be both beneficial and potentially dangerous, depending on how it is developed and used.\n",
      "\n",
      "On the one hand, AI has the potential to revolutionize many industries and improve people's lives in many ways, such as:\n",
      "\n",
      "1. Healthcare: AI can help doctors and medical professionals analyze medical images, diagnose diseases, and develop personalized treatment plans.\n",
      "2. Transportation: AI can improve transportation systems by enabling self-driving cars and trucks,\n",
      "Input: stop\n",
      "Stream Chat with Llama 2 (7B) stopped.\n"
     ]
    }
   ],
   "source": [
    "chat_history = []\n",
    "\n",
    "while True:\n",
    "    with torch.inference_mode():\n",
    "        user_input = input(\"Input:\")\n",
    "        if user_input == \"stop\": # 当用户输入 \"stop\" 时停止对话\n",
    "          print(\"Stream Chat with Llama 2 (7B) stopped.\")\n",
    "          break\n",
    "        stream_chat(model=model_in_4bit,\n",
    "                    tokenizer=tokenizer,\n",
    "                    input_str=user_input,\n",
    "                    chat_history=chat_history)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1.4 后续学习\n",
    "\n",
    "在下一节中，我们将指导您构建包含 BigDL-LLM INT4 优化的语音识别流程。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
